The quality of the Nanostring samples varies considerably, depending on the quantity of RNA present, extraction quality, Ncounter factors, hybridization success, etc. The Nanostring method includes some internal controls that allow us to exclude some samples, but this is a fairly crude and not very stringent method. More samples than flagged need to be removed. Up to now, I have been using a somewhat ad-hoc approach, removing samples either on the basis of numbers of genes expressed (biased where we are also looking at the expression of the same genes) or simply removing outliers. Now it’s time to devise a better approach. As Michalis suggested, I will be using the “housekeeping” genes whcih are used to normalise the samples.
-Via expression levels of housekeeping genes
-Via relative levels of housekeeping genes Should expect consistent ratios
So we’ll stay with the old (sum-based) normalisation method
Need to establish an appropriate threshold of expression above which replicability is acceptable. Look at the samples for which two Nanostring replicates are available.
And looking at the residual variation:
Anywhere above 5-10 reads starts to look somewhat reasonable here.
Where should we choose to place our cutoffs?
To place this into context, we need to ask how strongly expressed our housekeeping genes are.
Another measure of sample quality are the ratios between housekeeping gene expression levels
Two extreme outlier samples. Not that these samples What if we remove them?
PCA changes drastically! Any housekeeping gene ratio method will need to remove these at least (these are strongly expressed and so won’t be cut out by other methods).
If low quality samples don’t follow the same correlations between housekeeping gene expression, we can look to exclude those which don’t fit the pattern
Use chi-square tests to compare the proportial expression of
Results are very significant, but we don’t want to be so stringent as to remove all samples with a statistically sound deviation from average ratios, we just need to remove the worst sample
Here, we can simply add a cut-off on the basis of this Chi-statistic. Note that our previous “weird outliers” are thrown out by this measure.
Compute the difference between two probability distributions.
The top few of these are also fairly poor samples that don’t fit expected patterns
I propose the following cutoffs:
-Sample unflagged by internal controls
-At least 4 housekeeping genes expressed with at least 3 reads each
-Chi-stat under 1000 for housekeeping gene ratios
This is quite stringent, cutting samples available down to:
## [1] 111
(Out of 180 samples)
Not bad, but needs to be optimised further. This may be overly stringent and have cut the training dataset down too much.